A Connectionist Model.. 1 A Connectionist Model of Phonological Representation in Speech Perception
نویسندگان
چکیده
A number of recent studies have examined the effects of phonological variation on the perception of speech. These studies show that both the lexical representations of words and the mechanisms of lexical access are organized so that natural, systematic variation is tolerated by the perceptual system, while a general intolerance of random deviation is maintained. Lexical abstraction distinguishes between phonetic features that form the invariant core of a word and those that are susceptible to variation. Phonological inference relies on the context of surface changes to retrieve the underlying phonological form. In this paper we present a model of these processes in speech perception, based on connectionist learning techniques. A simple recurrent network was trained on the mapping from the variant surface form of speech to the underlying form. Once trained, the network exhibited features of both abstraction and inference in its processing of normal speech, and predicted that similar behavior will be found in the perception of nonsense words. This prediction was confirmed in subsequent research (Gaskell & Marslen-Wilson, 1994). Introduction Phonological variation is generally described in terms of a mapping from underlying to surface forms in the production of speech. For example the word wicked, which has the canonical form /wIkId/, can be realized as [wIkIb] in the context of [wIkIbpræñk]1 (wicked prank). This is an example of place assimilation, in which the place of a syllable-final coronal segment (e.g., /t/, /d/, /n/) changes to become more like the following segment. In the above case, the place change involves closure of the lips, transforming the coronal /d/ into a labial [b]. A similar process operates on coronal segments when followed by segments with velar place, as in the assimilation of /d/ to [g] in [regkoωt] (red coat). These changes may be complete, producing tokens with no residual coronal information. In other cases, the result of the place change is partial, producing a segment with phonetic information relating to two different places of articulation (Browman & Goldstein, 1991; Nolan, 1992; Byrd, 1992). A similar assimilation process operates on oral vowels followed by nasal segments, as in the word ban. Here, the nasality of the consonant spreads back to the vowel, producing a phonetically nasal vowel. In English, this process is allophonic, since the product of the assimilation does not cross a phonemic boundary. However, in other languages, such as Bengali, the process neutralizes phonemic distinctions. This is because such languages contain vowels which are underlyingly nasal and hence indistinguishable from assimilated oral vowels. Neutralizing processes, such as place assimilation in English or nasal assimilation in Bengali, appear problematic when we examine their effect on the perceptual system. Understanding speech involves a process of lexical access, in which a representation of the speech stream is compared to stored representations of words, which are used as the key to word meaning. The presence of fully assimilated segments of speech creates lexical ambiguity, in that the underlying identity of assimilated segments cannot be directly extracted from their phonetic properties. For example, a surface [b] could be an ordinary token of an underlying /b/, but it could also be an assimilated token of /d/, as in the wicked prank example above. Thus, an important challenge for any model of human speech perception is to explain how the system is organized to ensure that natural surface variants gain access to the stored knowledge about a word. This picture is further complicated by the intolerance we find for random deviations in lexical access. The research reported in this paper is described with reference to activation models of lexical access such as Cohort (Marslen-Wilson, 1987) and TRACE (McClelland & 1 The electronic form of this paper contains some nonstandard phonetic transcrptions. A Connectionist Model.. 3 Elman, 1986). In these models the state of the lexical access process is reflected in the activations of a lexical candidate. Candidates which closely match a featural representation of a section of the speech stream will be highly active, whereas candidates which match less well will have a lower activation. Access to stored information about a word is dependent on its activation, either in absolute terms or relative to other activations. Thus, the success of the lexical access process depends critically on the goodness-of-fit between sensory evidence and stored information about the form of words. A number of recent studies using on-line priming techniques indicate that, unless the goodness of fit between speech and lexical forms is high, access to lexical information is disrupted. For example, Marslen-Wilson and Zwitserlood (1989) used cross-modal associative priming to study the effects of word-initial mismatch on lexical access. The experiments compared the priming effects of nonword rhyme primes to the effects of the source words (for example, comparing noney-BEE to honey-BEE), finding that rhyme primes were always less effective than the source words in the facilitation of the target words. Indeed, only when the competitor environment of the source word was particularly sparse was there any significant facilitation by the rhyme prime. Similar studies (Connine, Blasko, & Titone, 1993; Marslen-Wilson, Moss, & van Halen, in press) show that even single phonetic feature deviations disrupt lexical access, although in these cases the disruption may not be complete. These findings suggest that natural phonological variation and random variation are treated in different ways by the perceptual system. Small random deviations, such as the phonetic changes used in the priming experiments, have a strong disruptive effect on the process of lexical access. In contrast, systematic phonetic changes, such as assimilatory change, are accommodated by the perceptual system. This contrast reflects two aspects of speech perception, which can in classical terms be separated into the process of lexical access and the representation of phonological form. We argue that lexical form representations are abstract, containing only the invariant properties of the words. In addition, a process of context-dependent phonological inference assesses the phonological viability of surface changes in their segmental context. Lexical Abstraction Lahiri and Marslen-Wilson (1991) argue that phonological representations in the mental lexicon are structured in accordance with the theory of radical underspecification (Archangeli, 1988). This theory states that only certain aspects of the form of a word are lexically represented. Only phonetic features which are unpredictable2 form part of the specification of a word, with all other information left unspecified. Generally, two types of argument are used to support underspecification (Keating, 1988). The first is based on transparency to phonological rules involving the spread of a harmonizing feature across intervening segments (see Hare, 1990, for a connectionist treatment of this phenomenon). The second, which we shall deal with in this paper, links underspecification to phonological variation. Underspecification can be used as a basis for the explanation of both assimilatory change in speech production as well as tolerance of regular variation in perception. For example, in the representation of vowels, the feature [oral] is generally accepted as the default state and is therefore not present in an underspecified representation (Archangeli, 1984). Thus, the nasalization of vowels through assimilation is explained as a consequence of this unspecified state: the unspecified segment can gain the nasality of a following nasal, as in ban. However, 2 Predictable features are those that can be derived by either context-insensitive (default) or contextdependent (redundancy) rules. A Connectionist Model.. 4 for languages which contain underlyingly nasal vowels, a nasal vowel cannot assimilate to a following oral consonant, because its underlying representation is already fully specified. It is the representational asymmetry which radical underspecification provides that is crucial to the current study. However, it is important to note that other varieties of underspecification, as well as other systems making use of privative features, exhibit similar asymmetries. Here, we take radical underspecification as our exemplar of these systems. Stemberger (1991) found evidence for underspecification in the production of speech errors. Analyses of both naturally occurring and experimentally induced speech errors revealed strong asymmetries in the types of replacement errors made. This was consistent with the idea that some phonological feature values are left unspecified, and therefore prone to erroneous change during speech production, whereas other values are lexically specified and thus more resistant to change. In the perceptual domain, the underspecified representation may act more like a filter. According to Lahiri and Marslen-Wilson (1991), successful lexical access depends predominantly on the degree of match between specified features of an underspecified lexical representation and the corresponding speech input. Therefore, if a vowel is specified in the lexicon as nasal, a matching segment of the speech stream must also be nasal. However, if a lexical entry contains an oral vowel, which is lexically unspecified for that feature, free variation is allowed on the value of the feature: the lexical entry will match both nasal and oral speech. This matching strategy tolerates systematic phonological variation, since it affects only unspecified features, which do not figure in the goodness-of-fit calculation. On the other hand, a general intolerance of random variation is maintained, since random variation will alter specified and unspecified features alike. Evidence in support of this theory comes from a number of gating studies. Lahiri and Marslen-Wilson (1991) compared the effects of vowel nasalization on subjects' perceptions cross-linguistically, using Bengali and English. In English, vowels generally have an oral surface form, although vowel nasalization can occur as a product of assimilation, as in the ban example above. The same assimilation process occurs in Bengali, but the language also contains words which are underlyingly nasal (i.e., the difference between oral and nasal vowels is distinctive). Lahiri and Marslen-Wilson presented Bengali and English subjects with three types of consonant-vowel-consonant (CVC) words. The first group were simple CVCs with an oral vowel and an oral final consonant (e.g., /bæd/, bad). The second type of item, denoted CVN, were words ending in a nasal consonant which contained an assimilated nasalized vowel (e.g., [tãn], tan). The Bengali subjects were also presented with CVNC words containing an underlyingly nasal vowel followed by an oral consonant (e.g., /pãk/, 'slime'). As the stimuli were presented, in gradually increasing gated sections, subjects were instructed to predict the identity of the complete word. The critical data for underspecification theory are the responses during vowel presentation, where the nasality of the vowel is known but the final consonant is still unknown. For the nasal vowels, English subjects used the presence of vowel nasalization to predict CVN words, whereas Bengalis responded with a mixture of underlyingly oral and nasal words (roughly 60% of the Bengali responses were of type CVNC, 6% were of type CVN and 28% were of type CVC). For stimuli with oral vowels, Bengali subjects almost never produced words with underlyingly nasal vowels (0.7% of responses), but responded with both oral CVC (80.3%) and nasalized CVN (13.4%) words. These data reflect two aspects of the internal representation of phonological form. The fact that subjects are able to map surface oral vowels onto the lexical representations of CVN words, which have surface nasalized vowels, suggests that the representation on which their A Connectionist Model.. 5 responses are based is an abstract underlying representation, which does not encode the surface nasality of these vowels. Thus, assimilatory nasalization does not discriminate directly between lexical candidates because the representation on which the discrimination process is based does not encode such a surface change.3 In addition, the asymmetry found in the Bengali data, whereby surface nasal vowels can match lexical representations of words with oral vowels, but surface oral vowels mismatch specified nasal representations, indicates that this underlying representation is underspecified, with only nasal vowels actually represented in the lexicon. A further study of neutralizing assimilation, comparable to the Bengali data, examined the representation of stop consonants in speech perception. Coronal place of articulation is generally regarded as the default place (cf. Paradis & Prunet, 1991) and is therefore absent from an underspecified representation of place. Again, this agrees with the susceptibility of coronals to place assimilation as described above and predicts a further asymmetry in the perceptual domain. This prediction was tested in a gating study using English (Marslen-Wilson & Nix, 1992; Nix, Gaskell & Marslen-Wilson, 1993). The study compared sentences such as (1) and (2) below, which contained words with coronal or non-coronal word-final stops. (1) They thought the lake cruise was rather boring (2) They thought the late train was rather boring In (1) the critical word (lake) has a surface velar segment, but is underlyingly ambiguous: the underlying word could indeed be a token of lake but it could also be late, which has undergone a process of assimilation. However, there is no such ambiguity in (2), since assimilation from non-coronal to coronal place does not occur. For example, the /k/ in lake cannot assimilate to the /t/ in train to give late train. The only phonologically viable reading of the critical word is therefore late. These sentences were presented to subjects in gradually increasing gated sections. At the offset of the critical word (i.e., late), the sentences of type (2) were fairly unambiguous, with 80% of subjects giving a response containing a word-final coronal. However, the type (1) words provoked a mixture of responses, with only 55% of responses conforming to the surface structure of the critical word. This pattern of responses is similar to the nasality data of Lahiri and Marslen-Wilson. Although words like late or lake may be unambiguous when heard in isolation, they show a strong asymmetry in their perceptual quality when presented in normal utterance context. The lexical entries for words such as lake, with non-coronal final segments, are specified for place and so will not match any speech input (such as late) with a different place of articulation. This explains the lack of ambiguity for the sentences of type (2). On the other hand, lexical entries for words such as late are unspecified for place and so will match with both coronal and noncoronal segments of speech. Taken as a whole, these results form strong evidence for an asymmetrical abstractness in the lexical representations of underlying form. In this paper we look at how this system might develop. We argue that such asymmetries develop as a response to the variability in the mapping between surface and underlying form. For speakers of Bengali, surface orality is a reliable indicator of an underlying oral vowel and is treated as such. On the other hand, surface nasality, since it can occur either through assimilation of an oral vowel or as the surface form of a nasal vowel, is treated more ambiguously. Similarly, for English speakers coronal surface segments are reliable indicators of an underlying coronal segment and are treated as such during speech perception. Labial and velar surface forms, since they can occur as the product 3 Nasalization can be used indirectly to discriminate between candidates because it allows the listener to predict, in languages like English, that the following consonant is nasal. A Connectionist Model.. 6 of the assimilation of an underlying coronal segment, must be treated as more ambiguous, at least until their following context is known. Phonological Inference There is evidence that not only the form of lexical representations, but also the process of lexical access is organized to accommodate regular variation. Phonological inference is a process which analyses segments with reference to their phonological context in order to elucidate their underlying identity. This is essentially the opposite mapping to that assumed to take place in speech production. In order to carry out this process, the perceptual system must be sensitive to the rules or constraints under which regular phonological changes take place. Using again the example of place assimilation, phonological inference operates as a regressive process, checking the viability of a phonological change by comparison with its following context. Thus, the /d/ underlying the surface [b] in [wIkIbpræñk] (underlyingly wicked prank) can be inferred from the place of the following [p]: the segments both have a surface labial place of articulation and so conform to the phonological rule describing the assimilatory process.4 Experimental evidence for this inference process comes from a study of place assimilation (Gaskell & Marslen-Wilson, in press). The study examined the viability of phonological changes by comparing stimuli such as [wIkIbpræñk] with phonological changes which violate assimilation rules (e.g., [wIkIbgeIm] — wickib game — where the labial place of the [b] conflicts with the velar [g], and thus the [b] cannot be an assimilated surface form of an underlying /d/). These stimuli were embedded in sentential context and were used as primes in a cross-modal repetition priming experiment (cf. Marslen-Wilson, Tyler, Waksler & Older, 1994). The visual target, which was the canonical form of the phonologically changed word (i.e., WICKED) was presented at the offset of the phonological change (i.e., at the offset of the [b]) and the time taken to make a lexical decision to the target word was measured. The response times were an indication of the degree to which the lexical entry of the target word was accessed by the phonologically changed prime. The results showed that an assimilated prime, in a context that validated the assimilation, primed responses as strongly as a canonical token of the prime (e.g., [wIkIdpræñk]). However, the same phonological change embedded in an unviable context for assimilation strongly reduced the priming effect. This contextual effect shows that the process of lexical access is sensitive to viability in its assessment of regular variation. This effect, since it occurs between words, does not reflect an aspect of the lexical representation of words. Instead it indicates a process of inference which operates on the form representations, either pre-lexically or after initial contact with the lexicon. Phonological inference provides additional constraints on the mapping of speech input onto highly abstract lexical form representations, which under a simple mapping process would match a large number of surface forms. Modeling Abstraction and Inference In the remainder of this paper we shall present a model of abstraction and inference in speech perception. Our primary aim is to construct a model of speech perception which accommodates the existing experimental data in this area. We shall also show that the connectionist learning approach provides a plausible account of the formation of such a mapping, based on the statistical properties of speech. Finally, we shall examine the consequences of such an approach, showing that the perception of phonologically changed 4 Such sequences do not always mark assimilation points (for example, the same sequence of segments occurs across the word boundary in club player). Often this kind of ambiguity can be resolved using lexical information (i.e. club is a word, but clud is not). A Connectionist Model.. 7 speech can be described in terms of the satisfaction of multiple lexical and phonological constraints. This, in turn, leads to testable — and in some cases tested — claims about the way phonologically changed speech is perceived. The classic connectionist model of speech perception is the TRACE model of McClelland and Elman (1986). TRACE is a localist network employing interactive activation and competition between nodes to examine the matching process between speech input and lexical and phonemic candidates. However, TRACE is inappropriate for the purposes of this study for a number of reasons. First, we wish to examine how the perceptual system may develop given variable speech input, whereas the architecture of TRACE prohibits learning. Second, we have argued that TRACE is unable to model the results of the priming experiments examining the perception of place assimilated speech (Gaskell & Marslen-Wilson, in press). Specifically, the process of lateral inhibition, employed by TRACE to reduce activations is unable to model the swift, powerful effects of mismatch found in these and other priming studies (see also Marslen-Wilson, 1993). Instead, the model we describe is a simple recurrent network (Elman, 1990; Norris, 1990), based on the backpropagation algorithm (Rumelhart, Hinton & Williams, 1986). Backpropagation allows the modification of links between nodes, by comparing the network's output to a set of training values. This enables the structure of the mapping to develop as a reflection of the network's experience. The simple recurrent network is an extension of this general method, incorporating time-delayed links between hidden units to allow generalizations to be made over temporal sequences of patterns. This is of particular value in the modeling of processes involving speech, where information is continuous and perceptual mechanisms operate on-line, often based on partial information. Norris (1990, 1992) used a simple recurrent network to examine the perceptual mapping between a featural representation of speech and localist lexical units. The performance of the network broadly corresponded to the Cohort model (Marslen-Wilson, 1987). In the early stages of the presentation of a word, all the matching candidates became active to some extent; but as soon as the uniqueness point5 of a word was reached, the activations of the mismatching candidates dropped sharply and the activation of the remaining matching candidate rose to nearly 1 (the maximum activation). Norris also found that the network, like humans, was intolerant of small deviations (Marslen-Wilson, 1993). For example, given the nonword input horonet (where coronet is a member of the training set), the activation of coronet never rose above 0.1. Simulations in our laboratory using a similar network architecture show intolerance of nonword deviation even at the end of a word (as in coronef for coronet). Shillcock and colleagues (Shillcock, Levy & Chater, 1991; Shillcock, Lindsey, Levy & Chater, 1992) examined the process of lexical access using a simple recurrent network trained to map from a segmented stream of phonemes to an output window, consisting of the network's evaluation of the current, the previous and the next phoneme. This simple architecture allowed the network to learn generalizations based on the statistics of the speech stream, using these generalizations to improve the network's performance in the prediction of upcoming phonemes. Effects normally explained as top-down or lexical, such as the Ganong effect (Ganong, 1980)6 , could thus be simulated in a bottom-up model with no explicit lexical level of representation (see also Norris, 1993). 5 This is the point at which the set of lexical candidates matching a sequence of speech sounds reduces to one word. 6 The Ganong effect refers to the experimental demonstration of word context influencing the perception of ambiguous segments by shifting the category boundary in voicing continua. A Connectionist Model.. 8 Following Shillcock et al. (1991, 1992), the approach we shall take is to impose on the network the minimum level of external structure necessary to model experimental findings about the perception of phonologically variant speech. For this reason, the network we describe does not incorporate explicit word or morpheme units in its output. Instead, we examine the extent to which phonological inference and abstraction in speech perception can be modeled simply by exploiting the surface properties of speech. The network is trained on the mapping from a systematically variant phonetic-feature representation of speech (i.e., the surface form of speech) to a canonical representation of the same speech (i.e., the underlying form). The model incorporates no explicit concept of a lexical entry, not because we believe phonological inference is independent of lexical access, but because we wish to determine the extent to which perceptual behavior can be explained as the result of a simple statistical learning mechanism, applied to the speech stream.7 Although the model may contain no explicit lexical entries, lexical form information is used in training the network. Training involves presentation of the underlying form of speech to the network as a standard by which the error of the network can be measured and reduced, and this information can only be gained if the canonical lexical forms of the words are available. As a developmental model of phonological processing in speech perception, this implies that lexical access must be successful as the phonologically variant speech is heard, so that the underlying form can be recovered and utilized. However, the research of Gaskell and Marslen-Wilson (in press) suggests that lexical access will only be completely successful if the phonological inference mechanism is already at work. A solution to this "boot-strapping" problem may be that, as perceptual mechanisms develop, the tolerance for error in the matching process is reduced. In other words, the process of learning to understand speech may involve a gradual tightening of the constraints involved in the goodness-of-fit computation in lexical access. This would allow access to lexical information for phonologically variant words early on in development and so enable phonological compensation processes to be learned. It is also possible that the kind of speech a young child is exposed to may contain fewer assimilated tokens, since a greater proportion of words will be spoken in isolation and utterances may be spoken more carefully (Kerswill, 1985, showed that the occurrence and extent of place assimilation depends strongly on factors such as speech rate and style). A further implicit assumption of our model is that the unassimilated form of a word can be identified as the underlying form and then employed in the perception of assimilated forms. This identification may be the result of a frequency comparison between the different tokens of a word. Since place assimilation only occurs in a tightly constrained phonological context and even then only optionally, most coronal segments will be phonetically realized in their unassimilated form. Therefore the most frequent form of a word can be taken as the underlying form and used in the perception of assimilated forms. As pointed out above, these frequency proportions may be even more skewed in favor of unassimilated forms in the speech encountered by young children. This view of the formation of phonological representations also has implications for the role of the perceptual system in the diachronic shaping of natural phonological changes. If the perceptual system for speech develops in the manner implied here, by gradual learning from experience, it follows that speech perception has a fairly passive role in the shaping of phonological rules. A connectionist network basically learns what it is trained to learn: there may be differences in the ease of learning different mappings (e.g., due to frequency 7 It is possible that the structure of the speech on which the network is trained will influence the development of the network's internal representations. This may result in implicit lexical (i.e. word-form) representations being developed. A Connectionist Model.. 9 differences between the different types of mapping), but there is no reason why, for example, assimilations of coronals to non-coronals should be learned more easily than assimilations in the opposite direction. This contradicts theories (e.g., Kohler, 1990; Ohala, 1990) in which phonological change is seen as perceptually tolerated simplification.8 To facilitate comparison with experimental results, we examine the response of the network to place assimilated speech. Thus, the input to the network is a stream of feature bundles, corresponding to segments of speech, in which place assimilations are artificially introduced. During training, the network's output is modified by comparison to the canonical underlying form of this speech, again represented using phonetic features. We predict that the network will learn the phonological regularities underlying place assimilation and apply them to novel combinations of words, exhibiting a process of phonological inference. We also predict that the network will learn to develop different tolerance levels for different place feature values. Such behavior is functionally equivalent to an underspecified representation of place of articulation (cf. Seidenberg, 1994).
منابع مشابه
A Connectionist Model of Phonological Representation in Speech Perception
A number of recent studies hove exomined the effects of phonological variation on the perception of speech. These studies show thot both the lexical representations of words and the mechanisms of lexical access are organized so that natural, systematic variation is tolerated by the perceptual system, while o general intolerance of random deviation is maintained. Lexical abstraction distinguishe...
متن کاملThe Interplay of Perception and Production in Phonological Development: Beginnings of a Connectionist Model Trained on Real Speech
Three forward models are presented that map articulatory positions onto acoustic outputs for a single speaker of the MOCHA speech database. Backpropagation learning was used to train the forward models on a database of 460 TIMIT sentences. Efficacy of the trained models was assessed by subjecting the model outputs to speech intelligibility tests. The results of these tests showed that enough ph...
متن کاملComputational modeling of dynamic decision making using connectionist networks
In this research connectionist modeling of decision making has been presented. Important areas for decision making in the brain are thalamus, prefrontal cortex and Amygdala. Connectionist modeling with 3 parts representative for these 3 areas is made based the result of Iowa Gambling Task. In many researches Iowa Gambling Task is used to study emotional decision making. In these kind of decisio...
متن کاملLexical contact during speech perception: A connectionist model
A connectionist architecture comprised of cell assemblies was developed and applied to the problem of speech perception at the phonemic and lexical levels. The problem addressed involved a disagreement amongst theorists over the possible sources of lexical priming effects. Speech was encoded in the model as the temporal activity of phoneme units that are connected to higher-level word assemblie...
متن کاملImplementations are not conceptualizations: revising the verb learning model.
In a recent issue of this journal, Pinker and Prince (1988) and Lachter and Bever (1988) presented detailed critiques of Rumelhart and McClelland's (1986) connectionist model of the child's learning of the phonological form of the English past tense. In order to address these criticisms, a new connectionist model was constructed using the back-propagation algorithm, a larger input corpus, a ful...
متن کاملPhonology and syntax in specific language impairment: evidence from a connectionist model.
Difficulties in resolving pronominal anaphora have been taken as evidence that Specific Language Impairment (SLI) involves a grammar-specific impairment. The present study explores an alternative view, that grammatical deficits in SLI are sequelae of impaired speech perception. This perceptual deficit specifically affects the use of phonological information in working memory, which in turn lead...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1995